Chocolate Sales Exploratory Data AnalysisΒΆ

Author: Krstoph LedererΒΆ

Data Set Source: KaggleΒΆ


Load the dataset from kaggle api utlizing kaggle secrets file stored in .kaggle folder i.e. Json file.ΒΆ

InΒ [828]:
import kaggle
import os
import json

# Load Kaggle API credentials from the kaggle.json file
kaggle_json_path = r'C:\Users\krist\OneDrive\Desktop\Kaggle Practice Data Set\kaggle.json'
with open(kaggle_json_path, 'r') as f:
    kaggle_credentials = json.load(f)

os.environ['KAGGLE_USERNAME'] = kaggle_credentials['username']
os.environ['KAGGLE_KEY'] = kaggle_credentials['key']

# Initialize Kaggle API
api = kaggle.KaggleApi()
api.authenticate()

# Example: Download a dataset
dataset = 'atharvasoundankar/chocolate-sales'  # Replace with the dataset you want to download
path = r'C:\Users\krist\OneDrive\Desktop\Kaggle Practice Data Set'  # Replace with your desired download path

# Download the dataset
api.dataset_download_files(dataset, path=path, unzip=True)

print(f"Dataset '{dataset}' downloaded to '{path}'")
Dataset URL: https://www.kaggle.com/datasets/atharvasoundankar/chocolate-sales
Dataset 'atharvasoundankar/chocolate-sales' downloaded to 'C:\Users\krist\OneDrive\Desktop\Kaggle Practice Data Set'

Look at the data via pandas daframes and import all need libraries for Explortory Data Analysis. Then Start Data Cleaning before we start to do Exploratory Data Analysis.ΒΆ

InΒ [829]:
%%HTML
<script src="require.js"></script>
InΒ [830]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
import chart_studio.plotly as py
from IPython.display import HTML
import seaborn as sns
import datetime as dt
import numpy as np
import altair as alt
import sklearn as sk

pio.renderers.default='notebook'

# Load the data
dfChocolateSales = pd.read_csv('Chocolate Sales.csv')

# Display all columns
pd.set_option('display.max_columns', None)

#Display the data (top 5 rows of data frame)
dfChocolateSales.head()
Out[830]:
Sales Person Country Product Date Amount Boxes Shipped
0 Jehu Rudeforth UK Mint Chip Choco 04-Jan-22 $5,320 180
1 Van Tuxwell India 85% Dark Bars 01-Aug-22 $7,896 94
2 Gigi Bohling India Peanut Butter Cubes 07-Jul-22 $4,501 91
3 Jan Morforth Australia Peanut Butter Cubes 27-Apr-22 $12,726 342
4 Jehu Rudeforth UK Peanut Butter Cubes 24-Feb-22 $13,685 184

In the table above we can see that Amount contains a $ sign and comma which mean it is a string and will need to be converted to a int so we can use it in our model.ΒΆ

InΒ [831]:
# Run  to lookinfo at all data types within the dataset
dfChocolateSales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Sales Person   1094 non-null   object
 1   Country        1094 non-null   object
 2   Product        1094 non-null   object
 3   Date           1094 non-null   object
 4   Amount         1094 non-null   object
 5   Boxes Shipped  1094 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 51.4+ KB

In the above output we see that there are 5 objects and one integer. However, Date needs to formated as a datetime, and amount as a integer.ΒΆ

InΒ [832]:
# Clean the Data

# Amount column is formated as a string and needs to be converted to an integer
## Remove the $ and , characters and convert the column to an integer
dfChocolateSales['Amount'] = dfChocolateSales['Amount'].str.replace('$', '').str.replace(',', '').astype(int)

# Convert the Date column to a datetime object
dfChocolateSales['Date'] = pd.to_datetime(dfChocolateSales['Date'])

# Convert Dates To Months
dfChocolateSales['Date'] = dfChocolateSales['Date'].dt.to_period('M').astype(str)
C:\Users\krist\AppData\Local\Temp\ipykernel_33408\14258022.py:8: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.


Create Python Class with functions that look at unique total, sum, average.ΒΆ

InΒ [833]:
# Create Class of Column Value Counts that contains functions for totals, mean, sum, etc...
class Column_value_counts:
    # Get the total unique values in a column
    def Unique_Totals(self, column):
        total_unique = pd.Series(column.unique()).count()
        return total_unique
    
    # Get Sums of Numeric Columns
    def Sum_Numeric_Columns(self, column):
        sum_column = column.sum()
        return sum_column
    
    # Get Group By Averaged columns
    def Group_by_AVG(self, df, group_by_column, avg_column):
        groupbyavg = df.groupby(group_by_column)[avg_column].mean().astype(int)
        return groupbyavg

# Create an instance of the class
column_value_counts = Column_value_counts()

# Print the results
print(f"There is a total of {column_value_counts.Unique_Totals(dfChocolateSales['Sales Person'])} sales people in the dataset representing {column_value_counts.Unique_Totals(dfChocolateSales['Country'])} countries.")
print(f"There are {column_value_counts.Unique_Totals(dfChocolateSales['Product'])} products in the dataset.")
print(f"The total boxes shipped is {column_value_counts.Sum_Numeric_Columns(dfChocolateSales['Boxes Shipped'])} and the average chocolate boxes sold by each sales person is:\n{column_value_counts.Group_by_AVG(dfChocolateSales, 'Sales Person', 'Boxes Shipped')}")
There is a total of 25 sales people in the dataset representing 6 countries.
There are 22 products in the dataset.
The total boxes shipped is 177007 and the average chocolate boxes sold by each sales person is:
Sales Person
Andria Kimpton         165
Barr Faughny           148
Beverie Moffet         184
Brien Boise            152
Camilla Castle         167
Ches Bonnell           156
Curtice Advani         153
Dennison Crosswaite    178
Dotty Strutley         190
Gigi Bohling           134
Gunar Cockshoot        155
Husein Augar           153
Jan Morforth           196
Jehu Rudeforth         168
Kaine Padly            161
Karlen McCaffrey       205
Kelci Walkden          161
Madelene Upcott        161
Mallorie Waber         145
Marney O'Breen         178
Oby Sorrel             175
Rafaelita Blaksland    126
Roddy Speechley        160
Van Tuxwell            133
Wilone O'Kielt         118
Name: Boxes Shipped, dtype: int64

Create a bar chart showing the total boxes sold for each choclate product.ΒΆ

InΒ [834]:
total_boxes_by_product = dfChocolateSales.groupby('Product')['Boxes Shipped'].sum().reset_index()

# Visualze the data in a bar chart using Altair
total_boxes_by_product_bar_chart = px.bar(
    total_boxes_by_product, 
    x='Product', 
    y='Boxes Shipped',
    color='Product', 
    title='Total Boxes Shipped by Product')
total_boxes_by_product_bar_chart.show()

Create Circle Chart showing Total Sales Amount by CountryΒΆ

InΒ [839]:
# Create ne df for chart
total_product_sales_amount_by_country = dfChocolateSales.groupby('Country')['Amount'].sum().reset_index()

# Create Choropleth Map
Total_Product_Sales_By_Country_Map = px.choropleth(
    total_product_sales_amount_by_country,
    locations='Country',
    locationmode='country names',
    color='Amount',
    hover_name='Country',
    color_continuous_scale=px.colors.sequential.Plasma,
    title='Total Product Sales by Country'
)

Total_Product_Sales_By_Country_Map.show()

Create line chart looking at total sales over time.ΒΆ

InΒ [836]:
# Create datframe for sales over time
dfsalesovertime = dfChocolateSales.groupby('Date')['Amount'].sum().reset_index()

#
Sales_over_time_line_chart = px.line(
    dfsalesovertime, 
    x='Date', 
    y='Amount', 
    title='Sales Over Time', 
    markers=True)


# Save the chart as an HTML file
Sales_over_time_line_chart

The highest amount in chocolate sales month over month was at the beginning of the year of Jannuary 2022 ($896.105k). While the lowest amount of chocolate sales month over month was in April 2022(674.051k).ΒΆ